Multimodal Fusion

The Best 161 Multimodal Fusion Tools in 2025

CodeBERT is a pre-trained model for programming languages and natural languages, based on the RoBERTa architecture, supporting functions like code search and code-to-document generation.

Multimodal Fusion

Llama 4 Scout 17B 16E Instruct

Llama 4 Scout is a multimodal AI model developed by Meta, featuring a mixture-of-experts architecture, supporting text and image interactions in 12 languages, with 17B active parameters and 109B total parameters.

Multimodal Fusion

Transformers Supports Multiple Languages

UniXcoder is a unified multimodal pretrained model that leverages multimodal data such as code comments and abstract syntax trees for pretraining code representations.

Multimodal Fusion

Transformers English

TITAN is a multimodal whole slide foundation model pre-trained through visual self-supervised learning and vision-language alignment for pathology image analysis.

Multimodal Fusion

Safetensors English

Qwen2.5 Omni 7B

Qwen2.5-Omni is an end-to-end multimodal model capable of perceiving various modalities such as text, images, audio, and video, and generating text and natural speech responses in a streaming manner.

Multimodal Fusion

Transformers English

MiniCPM-o 2.6 is a GPT-4o-level multimodal large model that runs on mobile devices, supporting vision, voice, and live stream processing

Multimodal Fusion

Transformers Other

Llama 4 Scout 17B 16E Instruct

Llama 4 Scout is a 17B parameter/16-expert multimodal AI model from Meta, supporting 12 languages and image understanding with industry-leading performance.

Multimodal Fusion

Transformers Supports Multiple Languages

Qwen2.5 Omni 3B

Qwen2.5-Omni is an end-to-end multimodal model capable of perceiving various modalities including text, images, audio, and video, while synchronously generating text and natural speech responses in a streaming manner.

Multimodal Fusion

Transformers English

Q-Align is a multi-task visual assessment model focusing on Image Quality Assessment (IQA), Image Aesthetic Assessment (IAA), and Video Quality Assessment (VQA), published at ICML 2024.

Multimodal Fusion

Biomedvlp BioViL T

BioViL-T is a vision-language model focused on analyzing chest X-rays and radiology reports, enhancing performance through temporal multimodal pretraining.

Multimodal Fusion

Transformers English

Meta Chameleon is a hybrid-modality early-fusion foundational model developed by FAIR, supporting multimodal processing of images and text.

Multimodal Fusion

LLM2CLIP Llama 3 8B Instruct CC Finetuned

LLM2CLIP is an innovative approach that enhances CLIP's cross-modal capabilities through large language models, significantly improving the discriminative power of visual and text representations.

Multimodal Fusion

Unixcoder Base Nine

UniXcoder is a unified multimodal pretraining model that leverages multimodal data (such as code comments and abstract syntax trees) to pretrain code representations.

Multimodal Fusion

Transformers English

Llama Guard 4 12B

Llama Guard 4 is a native multimodal safety classifier with 12 billion parameters, jointly trained on text and multiple images for content safety evaluation of large language model inputs and outputs.

Multimodal Fusion

Transformers English

Spatialvla 4b 224 Pt

SpatialVLA is a spatial enhanced vision - language - action model trained on 1.1 million real robot operation segments, focusing on robot control tasks.

Multimodal Fusion

Transformers English

Pi0 is a general robot control model based on vision-language-action flow, supporting robot control tasks.

Multimodal Fusion

Colnomic Embed Multimodal 7b

ColNomic Embed Multimodal 7B is a state-of-the-art multi-vector multimodal embedding model, excelling in visual document retrieval tasks with support for multilingual and unified text-image encoding.

Multimodal Fusion Supports Multiple Languages

Llama 4 Scout 17B 16E Linearized Bnb Nf4 Bf16

Llama 4 Scout is a 17-billion-parameter Mixture of Experts (MoE) model released by Meta, supporting multilingual text and image understanding with a linearized expert module design for PEFT/LoRA compatibility.

Multimodal Fusion

Transformers Supports Multiple Languages

CogACT is a novel Vision-Language-Action (VLA) architecture that combines vision-language models with specialized action modules for robotic manipulation tasks.

Multimodal Fusion

Transformers English

Llama 4 Maverick 17B 128E Instruct FP8

A native multi-modal AI model in the Llama 4 series, supporting text and image understanding, adopting a mixture-of-experts architecture, suitable for commercial and research scenarios.

Multimodal Fusion

Transformers Supports Multiple Languages

Colnomic Embed Multimodal 3b

ColNomic Embed Multimodal 3B is a 3-billion-parameter multimodal embedding model specifically designed for visual document retrieval tasks, supporting unified encoding of multilingual text and images.

Multimodal Fusion Supports Multiple Languages

Llama Guard 3 11B Vision

A multimodal content security classifier fine-tuned based on Llama-3.2-11B, optimized for detecting harmful text-image hybrid content

Multimodal Fusion

Transformers Supports Multiple Languages

Dse Qwen2 2b Mrl V1

DSE-QWen2-2b-MRL-V1 is a dual-encoder model specifically designed for encoding document screenshots into dense vectors to facilitate document retrieval.

Multimodal Fusion Supports Multiple Languages

Biomedclip Vit Bert Hf

A BiomedCLIP model implemented based on PyTorch and Huggingface frameworks, reproducing the original microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224 model

Multimodal Fusion

Transformers English

A lightweight unified multi-modal model that efficiently processes various modal data such as images, texts, audios, and videos, and performs excellently in speech and image generation.

Multimodal Fusion

Qwen2.5 Omni 7B GPTQ 4bit

A 4-bit GPTQ quantized version of the Qwen2.5-Omni-7B model, supporting multilingual and multimodal tasks.

Multimodal Fusion

Safetensors Supports Multiple Languages

Taxabind Vit B 16

TaxaBind is a multimodal embedding space model incorporating six modalities, focusing on ecological applications, supporting zero-shot classification of species images using taxonomic text categories.

Multimodal Fusion

NVIDIA Isaac GR00T N1 is the world's first open-source foundational model for general humanoid robot reasoning and skills, with a scale of 2 billion parameters.

Multimodal Fusion

Hume-System2 is the pre-trained weights of System 2 for a dual-system Vision-Language-Action (VLA) model, used to accelerate the training of System 2 and provide support for relevant research and applications in the field of robotics.

Multimodal Fusion

Transformers English

LLaVE is a multimodal embedding model based on the LLaVA-OneVision-0.5B model, with a parameter scale of 0.5B, capable of embedding text, images, multiple images, and videos.

Multimodal Fusion

Transformers English

Libero Object 1

Hume-Libero_Object is a dual-system vision-language-action model trained on the Libero-Object dataset. It has System 2 thinking ability and is suitable for research and applications in the field of robotics.

Multimodal Fusion

Transformers English

Hume-Libero_Goal is a Vision-Language-Action model based on dual-system thinking, designed specifically for robot tasks, integrating System-2 thinking to improve decision-making ability.

Multimodal Fusion

Transformers English

A 1-billion-parameter imitation learning diffusion Transformer model pretrained on 1M+ multi-robot operation data, supporting multi-view visual-language-action prediction

Multimodal Fusion

Transformers English

robotics-diffusion-transformer

Openvla 7b Oft Finetuned Libero Spatial

OpenVLA-OFT is an optimized vision-language-action model that significantly improves the running speed and task success rate of the basic OpenVLA model through fine-tuning technology.

Multimodal Fusion

Llama 4 Scout 17B 16E Unsloth Bnb 4bit

Llama 4 Scout is a multimodal mixture-of-experts model developed by Meta, supporting 12 languages and image understanding, with 17 billion active parameters and a 10M context length.

Multimodal Fusion

Transformers Supports Multiple Languages

A multimodal embedding model based on Qwen2.5-Omni-7B, supporting unified embedding representations for cross-lingual text, images, audio, and video

Multimodal Fusion

Llama 4 Maverick 17B 128E Instruct FP8

Llama 4 Maverick is a native multi-modal AI model launched by Meta. It uses a mixture of experts architecture and supports text and image input, outputting multi-language text and code.

Multimodal Fusion

Transformers Supports Multiple Languages

Llama 4 Scout 17B 16E Unsloth Dynamic Bnb 4bit

Llama 4 Scout is Meta's 17-billion parameter Mixture of Experts multimodal model supporting 12 languages and image understanding

Multimodal Fusion

Transformers Supports Multiple Languages

Llama 4 Scout 17B 16E Instruct INT4

The Llama 4 series is a native multimodal AI model launched by Meta. It adopts the Mixture of Experts architecture, supports text and image interaction, and performs excellently in various language and visual tasks.

Multimodal Fusion

Transformers Supports Multiple Languages

Llama 4 Scout 17B 16E Instruct FP8

The Llama 4 series is a native multimodal AI model launched by Meta, supporting text and image interaction. It adopts the Mixture of Experts architecture and performs excellently in text and image understanding.

Multimodal Fusion

Transformers Supports Multiple Languages

Eagle X5 13B Chat

Eagle is a series of high-resolution multimodal large language models centered around vision, supporting an input resolution of over 1K and performing excellently in tasks such as optical character recognition and document understanding.

Multimodal Fusion

Llama Guard 3 11B Vision

A multimodal content security classification model based on Llama-3.2-11B, supporting the detection of harmful text/image inputs and responses

Multimodal Fusion

Transformers Supports Multiple Languages

SinclairSchneider

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase